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Abstract 


In  this  paper  ve  consider  the  design  of  optimization  algorithms  to  run 
efficiently  on  highly  parallel  computer  architectures.  Most  efficient 
optimization  algorithms  require  the  calculation  of  first  and  possibly 
second  derivatives  of  the  objective  function  and  where  present  the 
constraints.  This  task  normally  dominates  all  other  tasks  undertaken  in 
solving  a  large  sized  problem.  The  other  main  task  is  usually  the  solution 
of  a  set  of  linear  equations. 

In  this  paper  ve  describe  our  experience  of  solving  these  two  tasks  on 
parallel  computer  architecture.  A  sparse  forward  implementation  of  doublet 
and  triplet  automatic  differentiation  is  described  that  enables  both  the 
gradient  and  hessian  of  objective  functions  to  be  accurately  and  cheaply 
solved.  It  is  shown  that  when  the  function  is  partially  separable  this  can 
be  performed  very  effectively  on  a  parallel  machine. 

The  effect  of  solving  sets  of  linear  equations  on  a  parallel  system  is  also 
described,  and  the  two  then  combined  in  effective  algorithms  for  both 
unconstrained  and  constrained  optimization. 
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programming,  parallel  optimization,  truncated  Newton,  preconditioned 
conjugate  gradients. 
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1.  Introduction 


For  many  years  the  scale  of  optimization  problems  that  can  be  solved 
has  been  limited  by  the  sequential  nature  of  available  computers.  The 
availability  of  highly  parallel  architectures  vill  significantly 
increase  the  size  of  problems  that  can  be  solved.  Indeed  this  is 
already  true  of  the  limited  parallel  architecture  systems  now 
available.  The  solution  of  such  problems  is  of  course  necessary  for 
the  SDI  programme. 

However,  because  all  existing  optimization  software  designed  before 
1983  was  essentially  designed  for  sequential  computers,  the  design  of 
parallel  algorithms  must  be  radically  different  and  so  the  initial 
search  on  algorithm  design  for  parallel  computers  can  be  undertaken  on 
relatively  small  parallel  systems. 

Simultaneously  with  the  move  towards  highly  parallel  architectures, 
the  language  ADA  has  become  available  including  the  features  of  user 
defined  data  types,  operator  overlay  and  parallel  (concurrent) 
tasking. 

The  project  aimed  to  utilise  these  features  of  ADA  to  design  efficient 
optimization  codes  for  use  on  highly  parallel  architectures. 

Specifically  the  following  four  areas  have  been  considered. 

(1)  To  investigate  the  effect  of  the  availability  of  user  defined 
data  types  and  operator  overlay  on  the  design  of  optimization 
software,  bearing  in  mind  the  ultimate  goal  of  use  on  a  highly 
parallel  computer  architecture. 

(2)  To  investigate  the  relative  efficiency  of  ordinary  direct  and 
indirect  iterative  methods  for  solving  Ax  *  b,  x  e  R"  on  a  parallel 
computer  system  with  P  processors  for  various  ratios  of  P:n.  And  to 
compare  the  efficiency  of  these  methods  with  those  based  on  interval 
arithmetic. 

(3)  To  investigate  the  design  of  algorithms  for  solving  the 
unconstrained  optimization  problem 
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(1.1) 


Min  F(x);  x  c  R" 

(4)  To  investigate  the  design  of  algorithms  for  solving  the 
constrained  optimization  problem 

Min  P(x) 

s.t.  e^(x)  1^0  i>l,...,E  xeR°  E<n 

hjCx)  >0  j  -  1,  H  (1.2) 

The  project  commenced  by  defining  the  data  types  vector  and  matrix, 
and  then  redefining  the  values  of  the  operators  +,  *,  so  that  the 

computer  automatically  performed  the  normal  operations  of  linear 
algebra. 

This  implied  that  if  the  operation 

y  «  A*x  (1.3) 

vas  requested  in  the  code,  vhere  A  was  of  type  matrix  and  x  of  type 
vector,  then  the  computer  produced  the  correct  vector  y  vith  no 
further  commands. 

It  vas  soon  realised  that  this  enabled  the  computer  code  for  the 
conjugate  gradient  algorithm  to  be  virtually  Identical  to  the 
mathematical  statement  as  all  the  "do"  loops  that  so  clutter  a 
FORTRAN  program  rapidly  disappeared. 

It  vas  immediately  apparent  that  it  vas  just  as  possible  to  define 
data  types,  "sparse  matrix"  and  "sparse  vector",  and  define  associate 
operators  so  that  the  operation 

y  =  A*x 

still  gave  the  correct  vector  and  made  full  use  of  the  sparsity  of  the 
matrix  vhen  doing  the  arithmetic. 

Codes  vritten  in  this  vay  are  far  easier  to  understand  and  check  than 
the  equivalent  codes  vritten  in  FORTRAN. 
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In  a  similar  way  a  data  type  "interval"  can  be  defined  and  its  related 
operators  implemented  so  that  rounded  interval  algebra  is  performed  to 
any  precision.  The  realisation  of  the  potential  of  such  computer 
algebras  led  to  the  award  of  further  funding  from  the  NAB  research 
initiative. 

The  current  state  of  the  software  developed  under  that  initiative  is 
described  in  Bartholomew-Biggs  (1990).  which  includes  the  calculation 
of  accurate  dot  products  and  of  matrix  vector  products  accurate  to  the 
last  significant  place  in  the  mantissa. 

Once  the  concept  of  defining  new  algebras  in  ADA  became  accepted  we 
realised  that  it  was  possible  to  define  the  algebra  of  doublets  and 
triplets  to  calculate  the  gradients  and  hessians  of  objective 
functions  and  constraints.  Our  implementation  is  outlined  in  Section 
2  where  it  is  demonstrated  that  especially  for  partially  separable 
functions  this  gives  an  accurate  and  fast  method  of  calculating  all 
such  gradient  vectors.  Ve  can  regularly  obtain  the  Jacobian  of  a  5000 
X  5000  set  of  O.D.E.s  in  less  than  50  times  the  cost  of  a  function 
evaluation. 

In  our  first  interim  report,  Dixon  (1987),  a  review  was  given  of 
parallel  methods  for  solving  sets  of  linear  equations  and  their 
application  within  optimisation  algorithms. 

In  that  report  two  different  approaches  to  the  design  of  parallel 
optimisation  software  were  identified. 

Approach  A 

The  calculation  of  each  objective  function  value  F(x)  is  divided  into 
P  parallel  tasks.  An  approach  which  leaves  the  responsibility  for  the 
efficient  use  of  parallelism  in  the  hands  of  the  user,  and. 

Approach  B 

An  algorithm  is  designed  so  that  it  can  accept  P  values  of  F(x)  or 
VF(x)  simultaneously.  An  approach  which  leaves  the  responsibility  for 
the  efficient  use  of  parallelism  in  the  hands  of  the  user. 
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Algorithms  for  both  approaches  are  described  in  Section  3. 

Our  experience  using  Interval  methods  to  solve  sets  of  linear 
equations  vas  very  disappointing  (Parkhurst  (1987))  and  after 
performing  the  comparison  between  direct  and  indirect  methods  given  in 
Tables  28  and  29  of  Section  5  (Dixon  &  Maany  (1987))  ve  tend  to  use 
preconditioned  conjugate  gradient  algorithms  for  symmetric  matrices 
and  after  considerable  additional  testing  (Parkhurst  (1990)),  the 
preconditioned  CGS  method  (Sonneveld(1986))  for  nonsymmetric  matrices. 

In  our  third  interim  report  (Dixon  &  Haany  (1988))  our  implementation 
of  a  truncated  Newton  code  using  Automatic  differentiation,  and  matrix 
and  vector  packages  in  ADA  was  compared  with  a  FORTRANN  implementation 
of  the  same  nominal  algorithm.  This  demonstrated  that  our  ADA 
implementation  was  only  three  times  more  expensive  than  the  FORTRANN, 
even  on  problems  of  3000  dimensions. 

In  the  fourth  interim  report,  Dixon,  Maany  and  Hohseninia  (1989a), 
further  results  were  given  for  the  sequential  ADA  code  and  the  first 
results  for  the  parallel  implementation  on  the  Sequent  Balance.  This 
is  also  discussed  in  Section  3. 

The  fifth  interim  report  contained  three  papers  presented  at  the  IFAC 
Symposium  on  "Dynamic  Modelling  and  Control  of  National  Economies". 

In  the  first  of  these  Dixon,  Maany  and  Mohseninia  (1989B)  further 
developed  the  algebras  of  automatic  differentiation  and  their 
inclusion  in  an  unconstrained  optimization  algorithm.  In  the  second 
Bartholomew-Biggs  (1989)  gave  the  first  presentation  of  the  use  of 
automatic  differentiation  in  a  constrained  optimization  algorithm, 
whilst  in  the  third  Mohseninia  (1989)  described  the  use  of  parallel 
automatic  differentiation  to  solve  Navler  Stokes  equations  by  a  finite 
element  method  on  the  Sequent  Balance  Multi  processor  machine.  These 
results  were  described  in  more  detail  in  the  sixth  Interim  report 
Dixon  &  Mohseninia  (1990)  in  which  a  closer  examination  of  the  results 
indicated  that  whilst  the  parallel  automatic  differentiation  was 
performing  effectively  the  parallel  linear  equation  solver  was 
suffering  due  to  the  dominance  of  communication  costs. 

All  our  experience  indicates  that  the  ratio  of  the  time  required  to 
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pass  a  floating  point  number  between  processors  to  the  time  required 
to  multiply  two  floating  point  numbers  is  a  critical  parameter  in  the 
design  of  parallel  software.  The  codes  described  in  Section  3  on  both 
the  Sequent  Balance  and  the  transputer  net  would  perform  much  better 
if  the  communication  speed  had  been  ten  times  faster. 

Section  2  of  this  report  is  therefore  devoted  to  a  decription  of 
automatic  differentiation  and  Section  3  to  our  experience  using 
unconstrained  optimization  algorithms  on  parallel  processing  systems. 
Section  4  is  devoted  to  constrained  optimization,  while  section  5 
contains  some  speculations  on  future  developments. 

Some  conclusions  are  given  in  Section  6. 

2.  The  calculation  of  derivatives 


2.1  Gradient  and  Hessian  of  a  function  f(x) 


If  we  return  first  to  the  unconstrained  optimisation  problem  (1.1);  we 
will  assume  that  the  objective  function  f(x)  is  computable  and  that  it 
can  be  expanded  as  a  series  of  elementary  operations  of  one  or  two 
variables 


F,(Xj,x,). 

So  the  objective  function  is  given  by 
For  i  «  n+l,  . . . ,  n-t-M 

next  i. 


The  elementary  operations  consist  of  addition,  subtraction  and 
multiplication  of  two  variables  supplemented  by  functions  of  one 


-  5  - 


variable  i.e.  sine,  cosine,  log,  exponential,  pover,  reciprocal  etc. 

Any  computable  objective  function  can  be  expanded  into  such  a  form  if 
Boolian  switches  are  ignored.  For  the  purpose  of  this  paper  such 
switches  will  be  ignored. 

If  we  consider  the  well  known  Rosenbrock  function 

f(x)  -  100(x/-  Xj)*+  (1-Xj)*  (2.2) 

This  could  be  expanded  as 


f  -  X,  -  Xj+  X, 


So  we  see  that  this  simple  two  dimensional  problem  contains  seven 
operations.  It  is  not  unusual  that  for  many  practical  problems  M>n^  or 
even  n^ .  This  example  also  indicates  that  the  order  of  the  operations 
is  often  flexible  and  that  the  calculation  may  be  indicated  by  a  graph, 
which  connects  each  operation  with  its  data  and  emphasises  any  parallel 
computation  possibilities  in  its  calculation. 


Figure  1.  Graph  of  Rosenbrocks  function 


This  graph  emphasizes  that  the  operations  labelled  7  and  8  could  have 
been  undertaken  at  any  stage  in  the  calculation  without  altering  the 
result.  It  also  indicates  that  at  most  two  processors  could  be  used 
and  that  at  least  five  sequential  steps  are  required. 

If  we  now  consider  any  least  squares  problem 

K 

f(x)  -  j  (2.4) 

k.l 

then  each  subfunction  s^(x)  could  have  a  graph  like  Rosenbrock's 
function  and  the  subfunctions  could  be  performed  as  independent 
calculations  and  the  graph  only  linked  at  the  end. 


Figure  2 

Any  function  with  a  number  of  parallel  subtrees  will  be  termed 
partially  separable  following  Griewank  and  Toint  (1981).  Frequently 
but  not  necessarily  each  of  the  subfunctions  s^  will  depend  upon  a 
relatively  small  number  of  the  independent  variables  i>l,  ...,  n. 

Now  suppose  we  wish  to  find  the  gradient  vector  Vi.  How  should  we 
proceed?  There  are  at  least  four  alternative  approaches. 
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Ml  Analytic  Differentiation;  The  formula  for  the  gradient  vector  are 
derived  analytically  and  entered  into  the  computer.  This  vas  the 
method  usually  adopted  by  practitioners  until  quite  recently.  For 
Rosenbrock's  function  this  is  fairly  simple  as  n«2  and  the  gradient 
vector  is  given  by 


7f 


r  400  x^<x^'-  x.  )  -  2(1-Xj)  'I 


[  -200(x^^-  Xj) 


(2.5) 


but  as  n  and  M  increase  this  becomes  increasingly  tedious  and  mistakes 
are  liable  to  occur. 


M2  Numerical  Approximation;  One  sided  or  central  difference  formula 
may  be  used  to  estimate  the  derivatives  i.e.  the  approximation 

=  (f(x+ha.)  -  f(x))/h  (2.6) 

could  be  adopted. 

This  is  of  course  simple  to  program  and  is  often  successfully  used,  it 
is  however  expensive  as  calculating  7f  now  requires  at  least  nM 
operations  and  is  also  dependent  on  choosing  a  suitable  value  of  h. 


M3  Symbolic  Differentiation;  Many  mathematical  aids  such  as  MACSYMA 
will  accept  a  FORTRAN  listing  of  an  objective  function  and  produce  a 
FORTRAN  listing  of  the  gradient.  The  difficulty  of  this  approach  is 
that  such  symbolic  codes  often  produce  inefficient  FORTRAN  and  to  date 
are  often  very  restrictive  on  the  complexity  (M)  of  the  problem  that 
can  be  tackled.  Grievank(1988)  gives  the  listing  of  the  gradient  of 
the  Helmholtz  energy  function  given  by  MACSYMA.  Such  a  listing  then 
has  to  be  evaluated  at  each  iterative  point  and  when  the  formula 

is  lengthy  it  is  not  clear  that  this  has  any  advantages  over  automatic 
differentiation. 
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M4  Automatic  Differentiation;  A  full  description  of  automatic 
differentiation  is  given  in  Rall(1981).  Automatic  differentiation  is 
essentially  the  computer  implementation  of  a  new  algebra,  vhich  is 
implemented  by  defining  a  new  data  type  and  overwriting  the  meaning  of 
operators  for  that  data  type.  The  simplest  such  data  type,  the 
doublet,  consists  of  the  n+1  numbers  U-(u,Vu).  Then  given  one  or  two 
doublets  the  meaning  of  the  elementary  operations  is  overwritten  so,  we 
obtain; - 


V 

V 

V 

V 

V 

V 


u+v 

->  (u+v, 

Tu+W) 

u-v 

(u-v, 

Tu-W) 

u*v 

-*  (u*v, 

uW+vTu) 

u* 

(uS 

2uTu) 

u-^ 

->  (u“^. 

-u'^TU) 

logU 

(logu 

,  u”^Tu). 

j 


(2.7) 


As  each  of  the  operations  in  2.1  is  of  this  form  we  can  obtain  the 
derivatives  by  forward  automatic  differentiation  in  terms  of  doublets;- 

For  i  «  n+1,  . . . ,  M+n 

«  F^((x^,9icp,(x^,Vx:^))  j,lc  <  i 

next  i 

This  approach  is  accurate  but  is  bounded  above  by  (2n-fl)M  operations 
due  to  the  fact  that  a  full  vector  7w  is  created  at  each  step  even 
though  W  is  usually  sparse. 


M4.1  Sparse  Doublet  and  Sparse  Triplet  Arithmetic;  An  efficient 
implementation  in  ADA  is  described  in  Dixon,  Maany,  Mohseninia(1989) 
where  the  vector  Tv  is  stored  in  sparse  form  i.e.  (Number  of  nonzero 
elements,  position,  value)  so  Vx^  «  (1,1,1)  even  if  n>1000. 

This  method  can  be  readily  extended  to  the  calculation  of  the  second 
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derivatives  by  introducing  the  algebra  o£  sparse  triplets  in  which 
T  ■>  (u,  Vh,  V^u)  and  both  VU  and  are  stored  in  sparse  form.  Ve 
will  illustrate  this  by  shoving  the  calculation  for  Rosenbrocks 
function  and  then  the  Helmholtz  energy  function.  As  n>2  for 
Rosenbrock's  function  the  illustration  will  ignore  sparsity. 


*1 

(Xj ;  1,0}  0,0,0) 

*2 

(Xj 5  0,1$  0,0,0) 

(Xj*;  2x^,0;  2,0,0) 

*4- 

X3-X, 

(x^;  2X3,-!;  2,0,0) 

X 

III 

R 

*4' 

(Xj}  4X3 x^,  -2x^;  4x^+  8x 

*6- 

100x5 

(  all  multiplied  by  100) 

X,- 

1-x, 

(x^ $  —1,0$  0,0,0) 

*a- 

(x,$  -2x,,0$  +2,0,0) 

X,. 

*6+ 

( x, $  400X3  *4 , -200x^  $ 

400x^  +800Xj  *,+2,200, -400x ^ ) 


which  is  the  correct  solution. 


It  will  be  noted  that  even  on  this  very  simple  problem  19  of  the  34 
entries  in  the  9  triplets  calculated  are  zero. 


The  Helmholtz  energy  function  introduced  by  Grievank(1988)  as  a 
standard  example  for  automatic  differentiation  is  given  by:- 


f 


n 


RT 


i-1 


l-b^'x 


x*Ax  l+(l+/2)b’’x 

- -  log  - —  . 

✓&b^x  l+(l-/2)b^x 


(2.8) 


This  can  be  divided  into  certain  subfunctions,  and  in  Table  1  ve  list 
the  number  of  operations  involved  in  the  calculation  of  f,  Vf  and  7* f 
in  sparse  doublet  and  sparse  triplet  arithmetic. 
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Table  1.1 

From  the  Table  1.1  ve  can  see  that  very  many  terms  contribute  O(n^) 
operations  to  but  that  the  term  x^Ax  dominates  F,  whilst  the  term 
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^  X,  log  x,/(l-b^x) 
i-1 

dominates  and  appears  to  make  it  an  O(n^)  operation.  As  Haany 
observed  when  performing  our  tests  in  ADA,  this  term  can  be  rearranged 
as 


n 


2  X  log  X  /(1-b  X)  -  51  X  (log  X  -  log(l-b  x)) 
i-l  i-1  ^ 


'  n  ^  I’  n  ' 

I  X  log  x^  -  2  X 

U-1  J  U-1  J 


and  vhen  performing  these  operations  in  this  order  ve  have 
following  operations: 


log  (1-b’x) 
the 


log  Xj 

log 

1 

1 

x^  log  x^ 

1 

1 

1 

So  Jx^  log  x^ 

n 

n 

n 

n 

n 

0 

log(l-b’^x) 

log 

2n 

3n* 

^5IxJlog(l-b*x) 

1 

3n 

6n* 

So  the  sum  becomes 

(n-(-l)log  +  3n 

9n 

9n*  +  3n 

&  the  overall  total 

(n+l)log  +  3n* 

4n*+  26n 

31n* 

Table  1.2.  Revised  operations  count 


Ve  note  that  the  n^  term  in  the  calculation  of  V^F  no  longer  occurs  and 
all  three  calculations  are  now  dominated  by  order  n^  terms. 

Ve  may  first  conclude  from  this  that  the  order  in  which  a  function 
calculation  is  performed  can  dramatically  effect  the  cost  of 
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calculating  the  derivatives  by  sparse  doublet  and  sparse  triplet 
methods. 

Ve  may  note  too  that  as  reported  by  Dixon,  Maany  and  Mohseninia  (1989) 
and  shown  below,  the  times  taken  by  the  ADA  implementation  confirmed 
the  constant  nature  of  the  ratios  of  the  number  of  operations  required 
to  evaluate  f,  Vf  and  V^f;  the  difference  in  the  actual  ratios 
obtained  do  however  imply  that  the  cost  of  the  index  operations  in  the 
xink  lists  cannot  be  ignored. 


Dimension 

Sparse 

Doublet 

Sparse 

Triplet 

5 

1.68 

33 

10 

3.20 

49 

20 

5.27 

46 

30 

5.55 

41 

40 

5.69 

46 

50 

5.95 

48 

60 

6.23 

48 

200 

7.23 

500 

8.15 

Table  2.  The  relative  cost  of  calculating  Vf  by  sparse  doublet  and  7^ f 
by  sparse  triplet  to  the  cost  of  calculating  £. 


M  4.2  Reverse  Automatic  Differentiation;  In  Griewank  (1988)  a  proof 
is  given  that  by  using  reverse  automatic  differentiation  whenever  f  can 
be  calculated  in  M  elementary  operations  then  Vf  can  be  calculated  in 
less  than  5M  elementary  operations  whatever  the  dimension  n. 

Recalling  the  notation  for  calculating  f  given  in  (2.1),  reverse 
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This  reduction  in  the  number  of  operations  required  to  calculate  V£  in 
reverse  rather  than  forward  automatic  differentiation  is  balanced  by 
the  need  to  store  the  computational  graph  so  that  it  is  available  for 
the  reverse  sweep. 

One  of  my  colleagues,  Bruce  Christianson  (1990)  has  written  an 
implementation  of  this  in  ADA  as  a  standard  "subroutine"  compatible 
with  our  ADA  optimisation  codes.  His  results  for  the  Helmholtz  energy 
function  are  given  in  the  following  table  3.  The  method  can  be  readily 
extended  to  obtain  the  product  V^f.u  for  any  given  direction  and  these 
results  are  also  given 
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Helmholz  Energy 
Function 

f 

ordinary 

f 

by  graph 

Vi 

by  graph 

V^fu 

by  graph 

o 

CM 

N 

C 

- 

7 

8 

16 

50 

11 

53 

52 

120 

100 

40 

320 

300 

660 

Times  in  Seconds 

Table  3  Reverse  Automatic  Differentiation 


2.2  Jacobian  of  a  vector  function  s^  (x) 


As  veil  as  needing  to  calculate  the  gradient  and  Hessian  of  a  scalar 
function  f(x)  ve  often  need  to  calculate  the  Jacobian  of  a  vector 
function  s^^.  This  occurs  in  constrained  optimisation  where  could  be 
the  vector  of  active  constraints t  in  ordinary  differential  equations 
or  in  the  least  squares  minimisation  of 

k 

£(x)  -  j  (x). 
k-1 

When  each  s^  is  a  separate  calculation  then  ve  can  use  either  sparse 
doublet  or  reverse  automatic  differentiation  to  efficiently  calculate 
the  gradient  of  each  s^.  The  sparse  doublet  form  is  particularly 
effective  when  each  s^  only  depends  on  a  fev  variables  (i.e.  the 
Jacobian  is  sparse). 

This  is  the  case  for  the  Lokta-Volterra  predator  prey  function 
described  in  Byrne  and  Hindmarsh  (1987). 

Using  the  sparse  doublet  implementation  in  ADA,  Parkhurst  obtained  the 
following  results: 

Evaluation  of  (f,  J)  using  Sparse  Doublets 

-  - 

Evaluation  of  (f)  using  ordinary  arithmetic 
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Evaluation  of  (£)  using  Sparse  Doublets 
Evaluation  of  (f)  using  ordinary  arithmetic 


r 


Dimension 

50 

200 

BOO 

1800 

3200 

5000 

52.39 

54.10 

51.14 

50.42 

49.67 

50.19 

>^2 

5.05 

4.74 

4.73 

4.75 

4.72 

4.77 

Table  4.  Performance  of  Sparse  Doublet  Automatic  Differentiation 


Preliminary  results  with  Christianson's  implementation  of  reverse 
automatic  differentiation  indicate  that  this  implementation  gives  a 
value  of  r^  of  approximately  19.  If  ve  assume  that  the  s,^  are  separate 
calculations  and  that  the  total  number  of  operations  in  calculating  S|^ 
is  M^,  then  the  total  operations  for  calculating  7s^  is  qM^  (vhere  q  is 
theoretically  less  than  5  for  reverse  automatic  differentiation). 

Nov  as  M  ■  then  the  total  number  of  operations  for  J  <  qM^«  qM. 

However,  assuming  that  the  s^  are  separate  calculations  is  not  always 
fair  as  frequently  there  will  be  common  strands  in  the  calculations. 

If  ve  drew  a  computational  tree  for  the  calculation  of  s,^  such  common 
strands  must  commence  at  the  start  of  the  tree  and  will  probably 
involve  a  subset  NEL^  of  the  variables  and  lead  into  a  subset  of  the 
subfunctions  and  will  involve  elementary  operations.  Let  us  assume 
the  common  strand  ends  with  a  value  x^.  There  can  of  course  be  more 
than  one  common  strand. 

In  sparse  doublet  arithmetic  ve  vould  proceed  forvard  through  this 
common  strand  once  and  vould  use  less  than  q  x  NEL^  x  operations  to 
obtain  Vx^. 


In  reverse  automatic  differentiation  ve  vould  reverse  through  the 
common  strand  times  and  vould  thus  appear  to  possibly  need  q  x  x 
operations.  This  destroys  the  simple  bound  relating  the  operations 
involved  in  the  calculation  of  s  with  those  required  by  the  calculation 
of  J. 
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Hovever  suppose  nov  ve  identified  the  common  strand,  and  it  would  have 
to  be  identified  if  it  is  to  be  utilised  in  the  calculation  of  s,  then 
ve  could  start  reverse  differentiation  at  the  end  of  the  strand  and 
calculate  by  one  pass  in  q  operations. 

Nov  suppose  ve  calculate  back  in  reverse  mode  for  the  gradient  of  s.  ve 

- 

will  reach  a  value  i>c  and  have  ~  ax  update  those  i<n 

contained  in  NEL^  by  x^+ 

This  linking  operation  involves  (NEL^)(K^)  elementary  operations,  so  ve 
have  replaced  q  operations  by  qM^  +  (NEL^)(K^)  operations.  There 

is  therefore  a  saving  provided 

q(K^-l)M^  >  (NEL^)(K^) 

which  will  frequently  be  true. 

Similar  considerations  apply  in  the  calculation  of  the  Hessian  matrix 
of  a  function.  If  ve  consider  the  calculation  of  the  gradient  of 
Rosenbrock's  function  by  reverse  differentiation  it  is  simply 


The  last  step  illustrates  that  not  all  the  operations  generated  by 
reverse  differentiation  are  elementary  and  ve  strictly  need  to  divide 
this  into  two  steps 
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The  combined  tree  for  the  function  and  gradient  calculation  is 
therefore  that  given  on  the  left 


/ 

/ 


I 

I 


s 


Figure  3 

Ve  may  make  the  output  a  scalar,  the  directional  derivative 
Vf*d  ■  Xjdj+  Xjdj 

by  adding  an  extra  node  at  the  bottom  and  then  apply  reverse 


differentiation  again  to  obtain  V^fd.  However  we  now  need  to  reverse 

back,  from  ,  x,  to  x^ ,  x^  and  ve  note  that  this  can  only  be  done 

through  the  dotted  lines  linking  the  two  graphs  i.e.  the  nonlinear 

operations.  This  observation  allows  us  to  drastically  reduce  the  tree 

as  shown  on  the  right  and  even  now  we  have  the  common  subtree  in  the 
calculation  of  x^,  x^  namely  that  from  x^^ ,  x,  **  .  Ve  may  note  too 

that  the  graph  is  symmetric  about  a  horizontal  line  passing  through  the 

nonlinearities.  This  is  a  general  result  for  all  such  "Hessian" 

graphs.  Whether  this  approach  to  the  Hessian  calculation  has  any 

advantage  over  that  described  and  implemented  by  Christianson  still 

needs  to  be  determined. 


2.3  Conclusions 


In  this  section  we  have  demonstrated  that  two  automatic  methods  now 
enable  us  to  compute  partial  derivatives  efficiently.  In  particular 
the  sparse  doublet/sparse  triplet  approach  gives  efficient  methods  for 
calculating  the  gradient  and  Hessian  of  partially  separable  scalar 
functions  and  of  the  Jacobian  of  vector  functions. 

The  reverse  differentiation  approach  requires  the  storage  of  the 
computational  tree  but  then  provides  a  very  efficient  way  of 
calculating  the  gradient  of  any  scalar  function  and  the  Jacobian  of 
vector  functions  that  do  not  contain  significant  subtrees.  The  method 
also  provides  a  very  efficient  means  of  calculating  V^f  u  for  given  u. 

A  method  of  handling  common  subtrees  in  reverse  differentiation  has 
been  suggested  above  which  should  be  used  whenever  qM^  >  NEL^. 

Some  interesting  conjectures  have  been  made  on  the  reduction  of  the  ■ 
computational  trees  for  Hessian  calculations. 


The  contrasting  values  for  sparse  forward  and  backward  calculation  are 


Reverse 

Sparse  Forward 

9f/f 

q 

NEL 

7^f/f 

q* 

¥iNEL(NEL-t-l) 

J/f 

qk* 

NEL 

V*fu/f 

2q  (ref[13I) 

2. NEL. 

q  <  5 


where  NEL  is  the  average  number  of  nonzeros  in  the  gradient  vector 
i  >  n  during  the  calculation,  if  the  function  is  partially  separable 
NEL  is  less  than  the  number  of  variables  in  each  subfunction. 

*  Ve  have  however  shown  that  a  better  bound  for  J/f  by  reverse 
automatic  differentiation  exists. 


3.  The  effect  of  Parallel  Computation 


3.1  Experience  using  the  ICL/DAP 

At  Hatfield  Polytechnic  our  investigation  into  the  benefits  of  using 
parallel  computation  commenced  with  the  study  of  Kanu  Patel  who 
implemented  unconstrained  optimisation  algorithms  on  two  very  different 
parallel  processing  machines  -  the  Neptune  machine  at  Loughborough 
University  and  the  ICL-DAP  at  Queen  Mary  College.  The  ICL-DAP  was  an 
SIMD  machine  with  4096  processors,  the  Neptune  an  MIHD  machine  with 
four  processors.  His  work  on  the  Neptune  was  mainly  concerned  with  the 
solution  of  small  dimensional  multi-extremal  problems  and  is  not 
particularly  relevant  to  this  paper. 

As  the  DAP  is  an  SIMD  machine,  each  of  the  4096  processors  must  either 
perform  an  identical  arithmetic  operation  or  be  turned  off.  On  this 
basis  he  chose  to  perform  function  evaluations  in  parallel  and  to  then 
approximate  the  gradient  and  Hessian  by  difference  formula,  and  thus 
implemented  a  parallel  modified  Newton  algorithm,  Patel  (1982). 
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3.1.1  The  Modified  Nevton  Algoritha 


Patel's  Parallel  Modified  Newton  Algorithm 

Step  1  Initial  guess  x*®*,  step  h,  tolerance  e 

Step  2  Calculate  f(x  ±  ha^  ±  ha^)  all  j  >  i  where  a^^  is  a  unit 

vector  along  i^*'  axis.  This  is  an  obvious  parallel 
calculation. 

Step  3  Calculate  V£  and  £  by  central  differences.  He  shoved  this 
could  be  performed  in  parallel  on  an  SIHO  machine. 

Step  4  Solve  the  set  of  linear  equations 

(T^f  +  ul)p  -  -VI  (3.1) 

Be  used  the  standard  DAP  equation  solver. 

Step  5  Perform  a  A096  grid  search  in  the  two  dimensional  space 

spanned  by  p  and  -Vf  which  sensibly  replaces  the  line  search 
given  that  4096  evaluations  take  the  same  time  as  1. 

Step  6  Arg  min  f(x^)  i  «  1,  ...,  4096 

using  the  very  fast  DAP  "min"  operation 

Step  7  Return  to  2  with  k  -  k+1. 

Essentially  this  algorithm  uses  P  -  processors  and  calculated  f,  Vf, 
V^ f  in  M  steps  and  solved  the  Nevton  equations  in  0(n*/P)  steps. 

At  the  time  when  the  relative  speed  of  calculating  4096  parallel 
function  evaluations  on  the  DAP  was  20  times  faster  than  evaluating 
them  sequentially  on  a  DEC  1091,  the  comparison  times  to  solve  five  64 
dimensional  problems  were  obtained.  Standard  Modified  Nevton,  Variable 
metric  and  Conjugate  gradient  codes  were  run  on  the  DEC  1091  and  the 
parallel  Nevton  on  the  DAP. 
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Problem 

Sequential 

Nevton 

Variable 

Metric 

Conjugate 

Gradient 

Parallel 

Nevton 

Quadratic 

16.60 

2.10 

1.18 

0.966 

Extended  Rosenbrock 

140.98 

80.78 

11.36 

36.626 

Extended  Povell 

134.97 

41.28 

11.06 

1.500 

Trigonometric 

2604 

455 

78.68 

17.57 

Extended  Box 

7196 

1263 

354.0 

202.37 

Table  5  Performance  of  the  Parallel  Nevton  Method 


From  these  results  it  is  easily  seen  that  the  parallel  Nevton  algorithm 
greatly  outperformed  the  sequential  Nevton  code,  often  by  far  more  than 
the  natural  value  of  20.  Hovever,  it  vas  also  obvious  that  on  these 
functions  the  conjugate  gradient  code  also  outperformed  the  sequential 
Nevton  code  and  that  a  comparison  of  its  times  vith  those  of  the 
parallel  code  vas  not  particularly  encouraging. 

Tvo  decisions  vere  made  at  that  time  vhich  have  greatly  influenced  our 
subsequent  development.  The  first  vas  that  as  the  functions  ve  vere 
interested  in  at  that  time  possessed  considerable  structure  this  should 
be  used  in  the  evaluation  of  f  and  in  the  solution  of  the  set  of 
equations.  The  second  vas  that  ve  needed  to  consider  conjugate 
gradient  type  algorithms  and  problems  vith  dimensions  greater  than  64. 


3.1.2  The  parallel  conjugate  gradient  algorithm 

The  first  problem  studied  vas  a  quadratic  function  derived  from  Stones  « 

set  of  heat  conduction  problems  namely:- 

‘  (ff  *  ■'v  (f)’- 

To  solve  this  problem  Ducksbury  (1984)  split  it  into  rectangular 
elements  and  approximated  T  in  each  element  by  the  standard  bi-linear 
mapping. 
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The  function  can  now  be  expressed  in  the  form 


f 


K 

I 


elements 


s 


•  1 


vhere 

*..  ■  If  (f A  ^ 

element 

and  s^^  is  a  function  of  4  variables,  the  values  of  T  at  the  corners  of 
the  elements. 

If  is  the  number  of  operations  needed  to  evaluate  s^^,  then  the 
total  operation  count  for  f  is  simply  C  vhere  C  is  the  number  of 
elements  and  is  slightly  less  than  n. 

Again  Vf  -  7s  , 
el 


el 

vhere  7s^^  only  has  four  nonzeros  and  16  nonzeros. 

The  resulting  set  of  linear  equations  vas  solved  on  the  DAP  using  the 
conjugate  gradient  algorithm.  The  decision  vas  taken  to  allocate  each 
element  to  its  natural  processor  on  the  DAP  grid  and  to  distribute  the 
search  direction  p  over  the  processors  so  each  processor  held  the  four 
components  that  influence  its  value  of  f.  Given  the  conjugate  gradient 
formula  this  only  required  data  passing  betveen  neighbouring  processors 
thus  reducing  the  communication  costs,  the  conjugate  gradient  code  then 
only  needs  the  calculation  of 


y 


P  ~  I  V's.j  p,j 
el 


(3.4) 
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+  DEC  -  1091 
*  ICL  -  DAP 


Figure  4.1  Performance  of  the  Conjugate  Gradient  Codes 


and  for  each  element/processor  this  is  simply  16  operations.  In 
practice  the  scalar  products  r^r,  y'^p,  p*p  were  also  calculated  using 
partial  vectors  distributed  over  the  processors. 

This  problem  is  equivalent  to  the  heat  conduction  problem  tested  by 
Stone  vho  introduced  a  number  of  different  cases  based  on  the 
distribution  of  K  and  K  . 

X  y 

The  simplest  case  with  K  >  K  >  1  leads  to  a  veil  conditioned  problem 

X  Jr 

and  the  performance  of  the  conjugate  gradient  codes  on  this  problem  are 
shown  in  Figure  4.1,  where  the  log  of  the  CPU  time  is  plotted  against 
the  grid  size. 
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Other  distributions  lead  to  less  veil-conditioned  problems  but  the 
relative  performance  of  the  sequential  and  parallel  codes  is  unaltered 
as  shown  in  Figure  4.2. 


★  ICL-DAP 
+  DEC-1091 


Figure  4.2  Performance  of  Sequential  and  Parallel  Codes 


The  time  required  on  the  sequential  machine  precluded  running  the  test 
for  all  the  grid  sizes,  but  there  was  a  speed-up  of  over  100  even  when 
only  using  a  small  part  of  the  DAP. 


This  work  vas  extended  to  the  nonlinear  domain  by  considering  the 
differential  equation 


(3.5) 


ilil  +  .  Ru  in 

ax^  3y^  ax 


0 


with  boundary  values  such  that  the  analytic  solution  vas 


6x, 


Rx, 


and  converting  it  to  a  three  dimensional  map  by  Introducing 


a 


au 

3y 


and  minimising 


Again  the  function  and  gradient  evaluations  were  distributed  over 
rectangular  elements  so  that  for  a  64  x  64  grid  there  were  3  x  4096 
unknowns  before  the  specification  of  the  boundary  condition.  This 
implies  that  the  Hessian  matrix  is  now  a  combination  of  12  x  12  blocks. 
Again  a  parallel  version  of  the  Fletcher  Reeves  algorithm  vas  used. 

A  typical  result  is  shown  in  figure  5(overpage),  where  the  number  of 
processors  used  P  *  (grid  size)^.  It  appears  that  the 
function/gradient  cost  is  proportional  to  the  (grid  size)^  sequentially 
and  is  therefore  constant  for  the  parallel  machine  and  that  the  number 
of  iterations  is  approximately  linear  in  the  grid  size. 
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So  the  sequential  CPU  a  (grid  size)^ 
parallel  CPU  a  (grid  size). 


*  ICL  DAP 
+  DEC  1091 


Figure  5  Performance  of  Parallel  and  Sequence  Conjugate  Gradient  Codes 


3.1.3  The  Truncated  Nevton  Algorithm 


Before  1983  most  existing  optimisation  routines  that  used  the  Nevtons 
equation 

r  -  V*f(x'‘)d  +  7f(x’‘)  -  0 

solved  the  equation  as  accurately  as  possible  even  though  the  current 
iterate  vas  far  from  the  optimal  solution.  For  large  problems  a  major 
part  of  the  computing  cost  can  be  attributed  to  the  solution  of  these 
equations. 
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The  fact  that  it  is  not  necessary  to  solve  these  equations  accurately 
vhen  far  from  the  solution,  vas  reported  by  Dembo,  Eisenstat  and 
Steihaug  (1982,  83).  They  introduced  the  idea  of  calculating  an 
approximate  solution  vhen  far  from  the  solution  and  suggested 
solvingthe  equations  by  the  conjugate  gradient  method  and  also 
truncating  the  iteration  before  an  accurate  solution  is  obtained. 

In  our  implementations  at  Hatfield  this  inner  conjugate  gradient 
iteration  vas  terminated  if 


(1) 


•j  "i 


*  min 


vhere  Vf  is  calculated  at  the  iterate  of  the  outer  iteration  and  r^ 
at  the  j^**  iterate  of  the  inner  iteration  or 


a  trust  region  imposed  to  limit  the  size  of  the  step  taken  in  the  inner 
iteration 


The  conjugate  gradient  method  has  the  property  that 


so  this  limit  vill  be  reached  and  terminate  the  inner  iteration  unless 
the  solution  is  vithin  the  trust  region 


or; 

(iii)  d^*Vf  >  -.llidjil  llVfll 

vhich  ensures  that  the  resultant  direction  satisfies  Wolfes  first 
condition  for  finite  termination. 

These  conditions  vhen  combined  vith  an  Armijo  style  line  search  that 
satisfies  Wolfes  Conditions  II  and  III  ensure  that  the  truncated  Nevton 
method  has  finite  termination  to  the  neighbourhood  of  a  stationary 
point. 
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Code 

Problem 

N 

No.F 

No.G 

EFE's 

C.P.U. 

TRUNEW 

Wood 

4 

103 

229 

1019 

0.48 

OPCG 

Wood 

4 

384 

155 

1004 

0.80 

OPVM 

Wood 

4 

128 

77 

436 

0.90 

E04KDF 

Wood 

4 

54 

202 

862 

0.42 

TRUNEW 

Ext.  Powell 

4 

39 

82 

367 

0.28 

OPCG 

Ext.  Powell 

4 

96 

40 

256 

0.45 

OPVM 

Ext.  Powell 

4 

9 

31 

173 

0.46 

E04KOF 

Ext.  Powell 

4 

18 

90 

378 

0.31 

TRUNEW 

Ext.  Powell 

60 

33 

90 

5433 

1.27 

OPCG 

Ext.  Powell 

60 

117 

48 

2997 

1.87 

OPVM 

Ext.  Powell 

60 

341 

194 

11981 

59.85 

E04KDF 

Ext.  Powell 

60 

42 

1302 

78162 

15.31 

TRUNEW 

Ext.  Powell 

80 

31 

95 

7631 

1.81 

OPCG 

Ext.  Powell 

80 

216 

93 

7656 

3.79 

OPVM 

Ext.  Powell 

80 

439 

251 

20519 

134.71 

E04KDF 

Ext.  Powell 

80 

43 

1643 

131483 

32.84 

TRUNEW 

Ext.  Rosenbrock 

■1 

41 

59 

631 

0.34 

OPCG 

Ext.  Rosenbrock 

■El 

99 

33 

429 

0.54 

OPVM 

Ext.  Rosenbrock 

133 

83 

963 

2.00 

E04ICDP 

Ext.  Rosenbrock 

■i 

41 

181 

1851 

0.53 

TRUNEW 

Ext.  Rosenbrock 

20 

49 

71 

1469 

0.56 

OPCG 

Ext.  Rosenbrock 

20 

139 

40 

939 

1.00 

OPVM 

Ext.  Rosenbrock 

20 

198 

113 

2447 

6.27 

E04KDF 

Ext.  Rosenbrock 

20 

52 

312 

6292 

1.21 

TRUNEW 

Ext.  Dixon 

80 

21 

148 

11861 

2.16 

OPCG 

Ext.  Dixon 

80 

1.36 

87 

7146 

3.04 

OPVM 

Ext.  Dixon 

80 

201 

102 

8361 

31.14 

E04KDF 

Ext.  Dixon 

80 

28 

929 

66268 

13.54 

TRUNEW 

Ext.  Powell 

2000 

41 

109 

218041 

24.98 

OPCG 

Ext.  Powell 

2000 

125 

49 

98125 

50.13 

TRUNEW 

Ext.  Dixon 

2000 

29 

851 

1702029 

206.42 

OPCG 

Ext.  Dixon 

2000 

1064 

520 

1041064 

393.47 

Table  6  Comparison  of  the  truncated  Nevton  code  (TRUNEV)  with  a 


ugate  gradient  (OPCG),  variable  metric  (OPVM)  and  modified  Nevton 


code  fEOAiSFr 

Our  first  results  comparing  this  algorithm  with  the  standard 
optimisation  codes  were  reported  in  Dixon  &  Price  (1986,  88)  and 
confirmed  the  evidence  reported  by  Dembo,  Eisenstat  and  Steihaug  that 
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this  method  vas  much  more  efficient  than  Modified  Nevton,  Variable 
Metric  or  Conjugate  Gradient  codes  over  a  vide  range  of  dimensions. 

For  convenience  these  results  are  shown  in  Table  6. 

In  this  code  the  matrix  vector  product 

V^fp 

required  in  the  truncated  Nevton  method,  vas  obtained  by  a  difference 
approximation 

V^fp  -  (Vf(x+hp)  -  Vf(x))/h 
as  suggested  by  Dembo  et  al. 

Ducksbury  (1984)  then  implemented  a  parallel  version  of  the 
Dembo-Steihaug  Truncated  Nevton  algorithm  on  the  DAP  and  compared  the 
relative  performance  of  the  parallel  conjugate  gradient  algorithm  with 
the  parallel  truncated  Nevton  method  on  a  large  number  of  problem 
including  ones  based  on  the  partial  differential  equation  (3.5). 

He  noted  that  the  Truncated  Nevton  method  consistently  outperformed  the 
conjugate  gradient  algorithm. 

A  typical  result  is  one  in  which  the  sequential  conjugate  gradient  code 
required  in  excess  of  one  hour  to  solve  a  39  x  39  grid  (4111  unknowns) 
and  the  parallel  conjugate  gradient  code  on  the  DAP  Just  34  secs,  this 
being  a  speed  up  of  104  over  the  DEC  1091  at  a  point  where  only  3/8  of 
the  processors  of  the  DAP  were  in  use. 

For  a  finer  grid  of  64  x  64  processors  (12036  unknowns)  the  conjugate 
gradient  code  on  the  DAP  required  50.76  secs  while  the  truncated  Nevton 
method  only  required  13.21  secs. 

Turning  nov  to  the  Navier  Stokes  problem, 


+  V 


.au 

3x, 
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9v  3v  ^  1  ^ 


3u  3v 
axj  axj  “  ° 

This  was  converted  into  a  set  of  6  first  order  equations  by  introducing 
the  fields 


u  3u  du 


d  .  V  i  e  .  ,  f  .  ^^and  p 


and  noting  that  as  the  continuity  equation  is  simply  b  -t-  f  -  0  the 
variable  f  can  be  eliminated. 

The  equations  become 


e,  .  ab  .  cd  -  i  .  gj  . 


ae  +  df 


1  fik  Is.  1  iL 

R  l3Xj  3XjJ  *  3Xj 


b  +  f 


_  3a  . 

®«  *  -ST  - 


3a 


a  .  M  .  e 

6  3Xj  * 


e  -  -5^  -  f 

’  3Xj 

and  the  objective  function  is 


I  -  Min 


e/dA 
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The  miniDization  is  carried  out  over  the  field  variables  and  as  the 
integrand  only  contain  first  derivatives  the  theory  of  the  calculus  of 
variables  provides  natural  boundary  conditions  that  vill  be  satisfied 
on  the  boundary  if  any  of  the  seven  fields  are  not  fully  specified. 

As  ve  now  have  seven  unknowns  at  each  node  point  the  storage 
requirement  for  for  each  finite  element  Increases  to  28  x  28  for 

the  truncated  Newton  method  and  this  was  not  available  on  the  DAP  when 
these  tests  were  iun. 


3x3 

5x3 

9x9 

17  X  17 

Value  of  I 

0.2037 

0.080917 

0.03533 

0.01730 

CPU 

2:56 

5:39 

14:11 

13:51 

Total  CPU 

36:37 

Iterations 

82 

164 

417 

389 

Table  7  Performance  on  the  Navier  Stokes  Equation 


3.2  More  thoughts  on  calculating  the  search  direction 


3.2.1  The  Truncated  Newton  Method  with  Automatic  Differentiation 


Our  first  results  obtained  combining  the  concept  of  automatic 
differentiation  with  the  truncated  Newton  algorithm  used  a  crude 
Fortran  implementation  and  were  reported  in  Dixon  &  Price  (1986,89). 

In  that  paper  it  was  noted  that  the  vector  matrix  product  f  p  could 
be  obtained  in  two  ways, 

(1)  Sparse  triplets 

First  form  T^f  and  then  form  the  sparse  matrix  vector  product  T^f p 
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(2)  Structured  triplets 


Modify  the  definition  of  a  triplet  to  be 
(f,  Vf,  7"fp) 

and  alter  the  basic  algebra  appropriately. 

It  is  important  to  notice  that  a  structured  triplet  simply  consists  of 
two  sparse  vectors  and  a  scalar  and  therefore  requires  far  less  store. 
Mohseninia  implemented  both  versions  in  ADA  and  found  that  the 
structured  version  is  much  slower. 


Structured 

Problem 

Dimension 

Sparse  Triplets 

Sparse  Triplets 

Extended 

Rosenbrock 

2 

00 

.62 

20 

41.12 

4.48 

40 

98.61 

11.66 

80 

420.08 

39.64 

Extended 

4 

.75 

.57 

Powell 

20 

27.89 

3.36 

40 

112.3 

9.02 

80 

760.7 

33.42 

Extended 

Dixon 

2 

.19 

.13 

20 

26.06 

2.96 

40 

103.51 

12.6 

80 

546.13 

42.31 

Table  8  Comparison  of  structured  and  sparse  triplet  automatic 
dif  ferentiation'T 

Ve  have  therefore  not  pursued  the  concept  of  structured  automatic 
differentiation  further.  It  does  however  require  far  less  store  when 
this  is  important. 
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3.2.2  Accurate  Arithaetic 


While  these  results  were  fairly  conclusive  in  implying  that  sparse 
triplets  vere  preferable  to  structured  triplets.  They  were  worrying  in 
that  they  implied  that  the  codes  vere  not  behaving  theoretically 
correctly  on  these  extended  functions. 

These  functions  have  the  property  that  as  their  dimensions  increase  the 
number  of  distinct  eigenvalues  remain  constant. 

In  1974  Dixon  had  conjectured  that  for  such  extended  problems  the 
number  of  iterations  of  variable  metric  and  conjugate  gradient 
algorithms  should  be  independent  of  n. 

This  was  proved  in  Spedicato  (1976) 

It  was  however  clearly  not  occurring  in  the  above  results.  The  reason 
appears  to  be  that  while  theoretically  the  value  of  the  scalar  product 

n 

t  Aibi 

i-1 

is  independent  of  the  ordering  of  the  elements  in  most  computer 
languages  this  is  not  true  and  this  breaks  the  symmetry  of  the 
calculations  in  the  variable  metric  algorithm. 

Dave  Hills  (1990)  wrote  a  simple  sort  to  make  the  scalar  product 
independent  of  the  order  and  probably  more  accurate  and  obtained 
results  that  reflected  the  theoretical  result. 

A  typical  result  obtained  using  the  variable  metric  algorithm  on  the 
extended  Powell  problem  is  given  overpage. 
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Dimension 

Number  of  iterations 
using  double  precision 
arithmetic 

Number  of  iterations 
using  sorted  dot 
products 

4 

39 

39 

8 

64 

39 

16 

25 

39 

32 

67 

39 

64 

52 

39 

128 

165 

39 

Table  9.  The  effect  of  accurate  arithmetic 


Due  to  the  result  of  this  experience  I  would  urge  people  to  be  very 
careful  how  they  calculate  scalar  vector  products,  and  matrix  vector 
products  to  ensure  that  the  arithmetic  does  not  destroy  the  symmetric 
structure  of  the  problem. 

It  is  essential  in  the  design  of  BLAS  software  for  general  purpose  use 
that  accuracy  is  not  sacrificed  for  speed. 


3.2.3  Maany^s  Test  set 


To  overcome  the  problems  with  the  extended  functions  so  often  used  to 
test  algorithms  in  large  dimensions  Maany  (1989)  (TR210)  introduced  a 
new  family  of  test  problems. 


f(x)  .  1.0  +  f  0.5  f  i-  I'x, 
i-1  ^ 


n-1  ,  ,  2N 

1=1  i=l 


N  r  -•  nK 
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Here  the  dimension  n  «  3N,  and  the  family  contains  three  parameters  fi, 
«,  K. 


With  K  s  0  the  eigenvalue  pattern  is  similar  to  that  in  an  extended 
family  but  for  other  values  of  K  this  property  is  lost  and  the  family 
gets  more  ill-conditioned  as  K  increases. 

If  H  >  0  the  problem  is  quadratic  and  diagonal,  vhen  H  /  0  the 
diagonals  on  either  side  of  the  main  diagonal  are  introduced  and  5  i 
introduces  four  vide  diagonals. 

Haany  tested  the  truncated  Nevton  code  including  sparse  triplet 
automatic  differential  on  twelve  cases  drawn  from  this  family  for  a 
range  of  dimensions  up  to  n  >  3000. 

His  results  indicated  the  robustness  of  the  truncated  Nevton  code  and 
the  efficiency  of  the  dynamic  data  structure  used  in  ADA. 

Three  sets  of  his  results  on  badly  conditioned  problems  are  detailed 
below: 


No  Preconditioning  n  «  3000 

3 

5 

K 

No.  of  iter. 

CPU  time 

X  in 
AD 

Case  10 

1/16 

1/16 

2 

8498 

19001 

12 

Case  11 

1/8 

1/8 

2 

6482 

15461 

15 

Case  12 

0.26 

0.26 

2 

17211 

38072 

11 

Table  10.1.  The  truncated  Nevton  Method  and  automatic  differentiation 
on  the  Haany  problem. 


These  results  indicated  that  the  percentage  of  time  being  spent  in 
conjugate  gradient  code  dominated  that  spent  in  automatic 
differentiation. 

The  results  were  therefore  repeated  using  a  diagonal  pre-conditioner. 
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Diagonal  Preconditioning  n  -  3000 

£ 

K 

No.  of  iter. 

CPU  time 

X  in 
AD 

Case  10 

1/16 

1/16 

2 

31 

1220 

84 

Case  11 

1/8 

1/8 

2 

31 

1226 

84 

Case  12 

0.26 

0.26 

2 

31 

1318 

84 

Table  10.2.  The  truncated  Nevton  Method  and  Automatic  Differentiation 
on  the  Maany  problem.  Effect  oi  preconditioning. 

These  results  indicated  that  it  was  essential  to  use  a  preconditioner 
but  that  when  doing  so  the  calculation  of  the  gradients  even  with  a 
sparse  automatic  differentiation  routine  could  dominate.  Ve  therefore 
wished  to  go  on  a  parallel  computing  system  to  see  how  these  results 
would  improve. 

Before  doing  so  however  it  is  appropriate  to  mention  the  results  of 
Vespucci  (1990)  who  has  shown  that  if  the  conjugate  gradient  algorithm 
was  replaced  by  that  equivalent  algorithm  from  the 
Abaffy-Broyden-Spedicato  family  which  theoretically  generates  the  same 
sequence  of  points  x** ,  then  the  number  of  outer  and  inner  iterations 
is  greatly  reduced. 

Two  typical  results  one  for  a  40  dimensional  modified  extended  Vood 
function  with  a  complete  set  of  distinct  eigenvalues  and  one  for  a 
member  of  the  Maany  family  with  n  -  150  and  K  >  3  Indicate  this 
effect . 


Maj.  It. 

Min.  It. 

Time 

Modified 

Extended 

CG 

412 

14962 

16.1 

Wood 
n  *  40 

ABS/CG 

137 

3218 

7.5 

Maany 

CG 

222 

30082 

400 

Function 
n  -  150 

ABS/CG 

32 

1519 

47 

Table  11.  The  effect  of  using  the  conjugate  gradient  algorithm  from 
the  ABS  family  on  iterations  and  computer  time 

-  37  - 


These  results  emphasise  the  need  for  accurate  arithmetic  in  CG 
algorithms.  The  results  are  not  strictly  comparable  as  the  ABS/CG 
method  stores  a  full  set  of  n  x  n  matrices  and  it  is  indeed 
remarkable  that  the  cost  of  updating  such  a  matrix  is  dominated  by  the 
effect  of  rounding  error  in  increasing  the  number  of  major  and  minor 
iterations. 

In  her  paper  she  advocates  replacing  the  conjugate  gradient  algorithm 
by  a  truncated  LL^  iteration  implemented  vithin  the  ABS  class  but 
using  the  sparsity  of  matrix 

A  few  of  her  results  with  this  code  are  given  in  the  following  table: 


Modified  Extended  Wood 

n  >  40 

CG 

412 

14962 

■f— 

ABS/CG 

137 

3218 

ABS/LL* 

73 

1967 

WSM 

n  -  100 

CG  failed 

to  converge  in 

1000  it. 

ABS/CG^ 

303 

11704 

161 

ABS/Lr 

130 

7917 

36 

Haany  Function 

n  -  150  K  -  3 

CG 

222 

300 

ABS/CG 

32 

47 

ABS/LL* 

51 

1651 

15 

n  -  300  K  -  3 

CG 

198 

52601 

115 

abs/ll’ 

75 

6405 

35 

n  -  900  K  -  3 

CG 

138 

116620 

595 

ABS/LL* 

83 

8149 

41 

Table  12  Use  of  the  ABS/LL^  code 

Because  of  the  sparsity  of  for  the  LL*  algorithm  no  difficulty  was 
experienced  in  storing  the  matrix  and  the  effectiveness  of  the  new 
approach  is  obvious. 


Analysis  of  the  results  does  however  indicate  that  if  the  use  of 
accurate  arithmetic  reduced  the  number  of  iterations  used  by  the  CG 
algorithm  to  the  number  required  by  the  ABS/CG  algorithm  then  the  time 
needed  by  the  CG  algorithm  would  be  lowest. 
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3.3  Results  using  the  Sequent  Balance  Systea 


To  obtain  results  using  automatic  differentiation  on  a  parallel 
processer  ve  needed  to  have  access  to  a  parallel  machine  that  could 
run  ADA  and  supported  concurrent  tasks.  The  Sequent  Balance  fulfilled 
this  criteria  and  Professor  Delves  allowed  us  access  to  his  machine  at 
Liverpool  University. 

Mohseninia  (1989)  implemented  automatic  differentiation  (sparse 
triplets)  and  the  truncated  Newton  method  on  this  system.  As  the 
Sequent  Balance  contains  far  fewer  processors  than  the  DAP  a  number  of 
elements  needed  to  be  allocated  to  each  processor.  Again  as  the 
processors  are  not  allocated  so  that  only  nearest  neighbour 
communication  is  possible,  the  effect  of  data  communication  is  more 
obvious . 

Typical  times  for  the  sparse  triplet  evaluation  of  f,  Vf,  9^ f  on  the 
Sequent  Mohseninia  (1989)  are  shown  below  for  the  Olsen  square  cavity 
problem. 


Elements 

Number  of 

Processors  Used 

a 

2 

3 

D 

5 

6 

a 

8 

9 

10 

8 

37 

20 

12 

10 

11 

12 

13 

8 

- 

- 

32 

156 

85 

62 

44 

38 

35 

26 

22 

24 

26 

128 

670 

352 

227 

175 

146 

124 

107 

92 

87 

81 

Table  13  Parallel  Sparse  Automatic  Differentiation 

These  results  were  really  very  satisfactory  and  indicated  that  dividing 
automatic  sparse  triplet  differentiation  into  concurrent  tasks  is 
effective. 

The  effect  of  data  communication  becomes  even  more  obvious  however  when 
considering  the  operations  required  within  the  truncated  Newton  code. 
The  dominant  operation  of  this  part  of  the  computation  is  the  product 
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of  the  sparse  matrix  V^f  with  the  search  direction  p,  and  the  cost  of 
this  multiplication  for  dimensions  512  and  1107  vithin  the  cavity 
driven  flow  problem  on  the  Sequent  Balance  vere 


Time 

Number  of  Processors 

B 

2 

3 

B 

5 

6 

7 

8 

9 

10 

n.5l2 

3.80 

2.40 

2.20 

2.10 

2.08 

2.10 

■w 

Q| 

2.4 

1107 

9.50 

6.00 

5.00 

4.5 

4.5 

4.3 

B 

D 

5.2 

Table  14  Parallel  Matrix  Vector  Multiplication 


These  results  are  really  quite  disappointing  and  indicate  that 
communication  costs  cannot  be  ignored  vhen  performing  linear  algebra  on 
parallel  systems. 

However  calculating  V^fp  is  much  less  expensive  than  forming  V^f  and 


using  the  figures  given  in  the  previous  section  and  assuming  P  >  8  ve 
have  as  an  overall  effect 

P  -  1 

P  .  8 

Time  in  AO  on  sequential  machine 

.84 

.12 

Time  in  linear  algebra 

.14 

.07 

Other  time 

.02 

.02 

1.00 

.21 

Giving  an  overall  speed  up  of  approximately  5. 
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3.4  Results  using  a  Transputer  Net 


At  about  this  time  the  Polytechnic  took  delivery  of  a  small  transputer 
network  and  ve  began  tests  to  discover  its  capabilities  and  to 
construct  a  model  of  its  behaviour. 

As  the  parallel  linear  algebra  had  proved  disappointing  on  the  Sequent, 
Jha  began  his  investigation  in  this  area  and  demonstrated  that  the 
operation  Av  could  not  be  performed  effectively  in  parallel  if  A  had  to 
be  downloaded  into  the  system  (see  table  15  below).  In  this  table  the 
measured  times  are  given  on  1,  2,  4  and  8  transputers  and  the  expected 
time  on  8  transputers  using  Jha's  model  of  the  communication  time  and 
computation  time  of  the  network. 


Number  of  Transputers 

Dim. 

1 

2 

4 

8 

Act. 

Exp. 

Act. 

Exp. 

Act. 

Exp. 

Act. 

Exp. 

Time 

Time 

Time 

Time 

Time 

Time 

Time 

Time 

16 

1 

- 

2 

- 

3 

- 

4 

- 

32 

5 

- 

7 

- 

14 

- 

17 

- 

64 

22 

- 

30 

- 

54 

- 

66 

- 

83 

34 

- 

46 

- 

84 

- 

103 

- 

96 

50 

- 

- 

120 

- 

147 

- 

128 

88 

89.6 

118.49 

213 

206.06 

255 

233.01 

Table  15  REAL  64  Matrix-vector  Multiplication 
Calculation  of  expected  timings  in  milliseconds 


However,  when  the  matrix  vector  multiplication  was  embedded  within 
Jacobi's  iterative  algorithm 


x**'  -  (I  -  A)  x*+  b  1 


An  effective  speed  up  was  obtained  (see  table  16). 
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Dim 

Number  of  Transputers 

1 

2 

4 

8 

No.  of 
itr. 

Time 

Time 

s.up 

Time 

s.up 

Time 

s.up 

16 

36 

49 

52 

0.94 

30 

1.63 

25 

1.96 

32 

39 

193 

197 

0.98 

109 

1.77 

75 

2.57 

64 

43 

819 

625 

1.31 

431 

1.90 

262 

3.13 

80 

42 

1236 

832 

1.46 

610 

2.03 

384 

3.22 

96 

42 

1768 

1097 

1.61 

760 

2.33 

549 

3.23 

128 

46 

3904 

2039 

1.92 

1248 

3.13 

839 

4.66 

Table  16.  Parallel  Jacobies  Method  on  Network  of  Transputers. 

The  difference  being  mainly  due  to  the  fact  that  A  only  had  to  be  down 
loaded  once.  The  results  obtained  vere  again  in  agreement  vith  the 
constructed  model >  it  should  perhaps  be  stressed  that  the  model 
predictions  are  dominated  by  the  communication  requirements  and  not  by 
the  computational  time. 


Dim 

Number  of  transputers 

1 

2 

4 

8 

Time 

Model 

Time 

Model 

Time 

Model 

Time 

Model 

128 

3904 

3894 

2039 

2096 

1248 

1220 

839 

834 

Table  17. 

Similar  good  speed  ups  vere  obtained  vhen  a  parallel  conjugate  gradient 
code  was  applied  to  a  full  matrix  vith  distinct  eigenvalues  (see  table 
18),  but  no  speed  up  could  be  obtained  by  implementing  a  sparse  matrix 
code  in  parallel  as  the  much  smaller  amount  of  computation  at  each 
iteration  vas  dominated  by  the  communication  costs. (Table  19). 
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(tatcis 


Tia* 


16 

1 

1 

32 

1 

1  358 

1 

64 

1 

1  3416 

1 

ao 

1 

1  8937 

1 

96 

1 

1  17786 

1 

128 

1 

1  39461 

1 

T 


Huabar  of  Traaaputarc 


T 


T 


Ztr 

Tlaa 

ftaUP 

Itr 

TiM 

s.up 

itr 

Tiaa 

■  •up 

itr 

26 

27 

1.67 

26 

19 

2.37 

26 

19 

2.37 

26 

63 

198 

1.81 

63 

116 

3.09 

62 

91 

3.93 

62 

167 

2013 

1.70 

167 

976 

3.50 

166 

621 

5.50 

163 

253 

4669 

1.91 

256 

2438 

3.67 

279 

1269 

7.042 

234 

355 

9128 

1.95 

355 

4242 

4.10 

347 

2829 

6.29 

388 

449 

21483 

1.84 

479 

10754 

3,«7 

513 

6188 

6.377 

513 

I 


Full  coda  varslon 

For  aatrix  A^A 

*  Tiaaa  ara  in  allliaaeeadi 
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Dim 

No.  of 
NNZ 

No.  of 
iter. 

Number 

of  transputers 

1 

2 

4 

Time 

Time 

s.  up 

Time 

s.  up 

16 

74 

15 

21 

22 

0.95 

25 

0.84 

32 

154 

25 

70 

69 

1.01 

75 

0.93 

64 

314 

38 

212 

202 

1.05 

214 

0.99 

80 

394 

43 

300 

283 

1.06 

298 

1.01 

96 

474 

47 

394 

369 

1.07 

387 

1.02 

128 

634 

55 

614 

572 

1.07 

599 

1.07 

IN 

FULL  MATRIX  CODE 

128 

634 

55 

5131 

2745 

1.87 

1498 

3.43 

Table  19.  Sparse  Conjugate  Gradient  vith  Matrix  of  distinct 
eigenvalues 

Eigenvalue  to  1,  n,  so  CN  ■  N. 

Sparse  code. 

*  Time  in  milliseconds 


From  our  experience  ve  would  stress  the  difficulty  of  getting  speed  up 
for  the  solution  of  sparse  linear  equations  on  parallel  systems,  and 
the  necessity  to  model  the  communication  time  carefully  when  predicting 
the  performance  of  parallel  algorithms. 


3.5  Concurrent  Developments  in  Parallel  Optimization 

In  this  section  an  attempt  will  be  made  to  put  the  methods  described  in 
the  previous  context  into  perspective  alongside  those  undertaken 
elsewhere. 

In  one  of  the  earliest  reviews  of  the  subject  of  parallel  computing  in 
optimization,  Schnabel  (1984),  identified  two  classes  of  optimization 
problems  that  would  benefit  from  parallel  processing,  namely. 
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(1)  large  problems  with  dimensions  of  more  than  100 

(2)  the  multi-extremal  global  optimization  problem. 

In  this  study  ve  are  concerned  with  the  former  problem. 

Turning  first  to  the  classic  situation  of  unconstrained  optimisation  in 
1984.  There  were  four  established  classes  of  sequential  algorithms. 

(1)  Modified  Newton  methods  in  which  the  Hessian  matrix  G  was 
calculated  and  then  a  modified  Newton  equation  of  the  form 

(G  +  iil)d  »  -g 

solved  by  Choleski  decomposition.  If  M  was  the  number  of  steps  needed 
to  calculate  the  function  value  then  each  iteration  in  the  days  before 
automatic  differentiation  involved  approximately 

VbCnXn+DM  +  l/6n* 

operations. 

(2)  Variable  metric  methods  in  which  either  an  approximation  B  to  the 
Hessian  matrix  G  was  updated  using  one  gradient  evaluation  or  an 
approximation  H  of  the  inverse  was  used.  As  methods  for  updating  the 
Choleski  decomposition  were  known  both  versions  could  be  performed  in 

(n+l)M  +  cn^ 

operations. 

As  it  was  found  that  these  methods  rarely  required  n  times  the  number 
of  iterations  than  the  modified  Newton  this  was  usually  preferred  if  n 
>  5. 

(3)  Conjugate  gradient  methods  of  the  Fletcher-Reeves  (1963)  or  Polak 
Ribiere  (1971)  type  that  had  the  property  that  they  would  minimise  a 
quadratic  function  in  a  number  of  iterations  equal  to  the  number  of 
distinct  eigenvalues. 


-  45  - 


At  each  iteration  these  methods  only  need 


(n-fl)M  +  c^n  operations 
and  became  standard  for  n  >  100. 

(4)  Direct  search  methods  vhere  the  gradient  or  Hessian  were  not 
evaluated  or  estimated.  The  most  successful  of  these  vas  undoubtedly 
the  Nelder  and  Mead  (1965)  Simplex  algorithm  which  regrettably 
frequently  converges  to  an  incorrect  point  if  n  >  5. 

(5)  As  mentioned  earlier  in  this  paper  this  situation  vas  dramatically 
altered  by  the  advent  of  the  truncated  Nevton  method  which  contained 
two  types  of  iteration. 

An  outer  iteration  requiring 

Vm(n-fl)M  operations 

and  a  much  more  frequent  inner  iteration  essentially  dominated  by  NNZ 
operations  vhere  NNZ  is  the  number  of  nonzero  in  G. 

This  method  vas  usually  more  efficient  than  any  of  the  others,  but 
naturally  each  still  has  its  advocates  and  parallel  versions  of  each 
have  been  investigated. 

For  instance  in  Lootsma  (1986)  a  similar  approach  to  Patel's  parallel 
modified  Nevton  method  is  described,  in  Straetter  (1973)  a  parallel 
variable  metric  algorithm  is  given,  while  many  authors  have  noted  that 
the  vector  operations  in  the  conjugate  gradient  algorithms  are  ideally 
suited  for  implementation  in  parallel.  Chazan  and  Miranker  (1970) 
described  an  early  parallel  direct  search  method  based  on  parallel 
directions,  an  idea  extended  in  Sutti  (1983). 

The  parallel  modified  Nevton  methods  typically  estimated  the  Hessian 
matrix  by  performing  %i(n-fl)  parallel  function  evaluations  or  n 
parallel  gradient  calculations  so  that  the  parallel  operation  count 
became 

M  +  c,n^  operations. 
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This  requirement  for  exactly  P  -  to(n-i-l)  or  P  «  n  processors  is  however 
very  restrictive  so  Byrd,  Schnabel  and  Shultz  (1988)  consider  methods 
that  used  parallel  gradient  evaluations  with  2  <  P  <  n  or  parallel 
function  evaluations  with  2n+l  <  P  <  (u^-<-  3n)/2.  In  their  paper  they 
argued  in  favour  of  using  the  unfactored  inverse  H  form  of  the  variable 
metric  algorithm  and  divided  its  rows  between  the  P  processors,  and 
either  updated  it  P  times  per  iteration  for  the  P  gradient  calculations 
.'erformed,  or  formed  P  correct  columns  of  the  matrix  using  this 
information;  further  details  of  this  method  are  also  given  in  Schnabel 
(1988). 

Van  Laarhoven  (1985)  applied  Street ters  ideas  for  parallel  variable 
metric  algorithms  to  the  larger  Huang  class  of  updating  formula.  Be 
showed  that  the  rank  one  update  is  the  only  one  which  provides  a 
parallel  extension  with  the  property  of  quadratic  termination,  in  the 
sense  that  n  updates  using  linearly  independent  directions  d^  produce  a 
matrix  that  is  equal  to  the  Hessian  or  its  inverse.  This  approach  was 
first  tested  on  large  problems  by  Dayde  (1989),  it  was  discussed 
further  in  Dayde,  Lescrenier  and  Toint  (1989)  who  conclude  that 
Newton's  method  using  finite  differences  of  the  gradient  to  estimate 
the  Hessian  is  much  more  efficient. 

The  problem  of  designing  a  direct  search  method  for  use  on  parallel 
computers  has  been  resolved  by  the  research  of  Torczon  (1989).  She 
showed  first  that  the  Nelder  &  Mead  algorithm  can  and  does  frequently 
fail  on  large  dimensions,  then  showed  how  to  design  a  sequential  direct 
search  method  that  theoretically  and  practically  converges  in  large 
dimensions  and  finally  constructed  a  parallel  version  that  has  a 
relative  speed  up  as  P  increases  (i.e.  speed  up  >  P).  This  result  is 
of  considerable  theoretical  importance.  It  is  the  only  method  that 
effectively  utilises  P  >  n^  processors,  but  has  the  disadvantage  that 
even  for  small  n  it  is  very  much  more  expensive  than  parallel  gradient 
methods.  Further  developments  of  this  research  are  given  in  Dennis  and 
Torczon  (1990). 

All  the  above  research  was  carried  out  before  automatic  differentiation 
became  available.  Using  sequential  automatic  differentiation  the  costs 
of  all  four  gradient  classes  reduce. 
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Before 


With 


Modified  Nevton 

Vin(n+1)M  +  l/6n’ 

qnM  +  l/6n 

Variable  Metric 

(n-«-l)N  +  cn* 

qM  *  cn* 

Conjugate  gradient 

(n+l)M  +  c^n 

qM  +  Cjn 

Truncated  Nevton  outer 

%n(n+l)M 

qnM 

inner 

NNZ 

NNZ 

Again  our  experience  is  that  the  truncated  Nevton  method  is  by  far  the 
most  effective.  In  this  report  ve  have  described  many  Implementations 
vhere  the  parallelism  is  used  vithin  the  function  evaluation 
effectively  reducing  M  to 


the  alternative  (approach  B)  use  of  parallelism  is  to  perform  the  n 
gradient  calculations  in  parallel  effectively  reducing  n  to 

n 

F  • 

This  has  been  implemented  by  Sofer  and  Nash  (1990)  who  also  give  a  nev 
parallel  equation  solver  thus  reducing  the  cost  of  the  inner 
iterations. 

Both  codes  are  very  effective.  Essentially  P  in  the  Sofer-Nash 
approach  is  limited  to  n  and  the  Dixon-Maany-Mohseninia  approach  to  K  a 
number  determined  by  the  partially  separable  nature  of  the  function. 

However,  the  two  parallelisms  are  complementary  and  if  sufficient 
processors  were  available  on  a  system  both  could  be  utilised  on  Kn 
processors. 
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4.  Constrained  Optinisation 


In  this  section  ve  will  consider  the  solution  of  constrained 
optimization  problems  on  parallel  processing  systems. 

The  problem  under  consideration  is  therefore, 

Min  f(x)  X  €  R" 

s.t.  e^(x)  -  0  i  -  1,  E  (4.1) 

hj(x)  >0  j  -  1,  ....  J 

An  early  approach  to  these  problems  involved  the  introduction  of 
penalty  or  barrier  functions.  The  simplest  penalty  function  being 
defined  as 


Pj(x,r)  -  f(x)  +  p 


E  J  ' 

f  e^(x)  +  (h.(x))f 

i.l  j.l  ^  j 


(4.2) 


where  the  (  )_indicates  that  the  term  is  only  included  when  hj(x)  <  0. 


A  barrier  function  method  is  restricted  to  inequality  constraints  and 
requires  a  feasible  starting  point.  The  barrier  function  is  usually 
defined  as 


B(x,r)  -  f(x)  -r  Y.  log(+h  (x))  (4.3) 

j  ^ 

In  the  original  algorithms  Fiacco  and  McCormick's  S.U.M.T.  method  was 
used  in  which  P^(x,r)  or  B(x,r)  was  minimised  for  a  sequence  of 
reducing  values  of  r.  The  starting  point  for  the  next  value  of  r  was 
determined  using  the  solution  at  previous  values,  so  that  the  change  in 
X  needed,  decreased  rapidly  with  r. 

It  was  later  realised  that  minimising  the  Penalty  (or  Barrier)  function 
exactly  at  each  value  of  r  was  unnecessary  and  then  recursive 
(sequential)  quadratic  programming  codes  became  popular.  These  utilise 
a  quadratic  approximation  to  the  Hessian  of  the  Lagrangian  and  linear 
approximations  to  the  constraints  and  decrease  r  at  each  (most) 
iterations.  In  most  of  these  codes  the  quadratic  approximation  to  the 
Hessian  is  updated  using  the  variable  metric  principle  so  they  require 
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accurate  estimates  of  gradient  of  the  function  and  the  Jacobian  of  the 
constraints.  Three  popular  algorithms  of  this  type  are 

OPXRQP  (Bartholomev-Biggs(1975)) 

VMCON  (Povell  (1977)) 

VMCVO  (Chamberlain  et  al  (1982)) 

These  algorithms  continue  to  perform  as  expected  if  analytic  first 
derivatives  are  replaced  by  doublet  automatic  differentiation  (Price 
(1987)).  At  about  the  same  time  ve  were  experiencing  difficulty 
solving  space  satellite  trajectory  optimization  problems  with  these 
codes  as  the  constraints  were  so  nonlinear,  so  Brown  (1986)  devised  a 
series  of  test  problems  with  highly  nonlinear  constraints.  On  these 
problems  the  recursive  quadratic  programming  codes  consistently  failed 
but  an  ODE  method  based  on  the  implicit  Euler  algorithm  (CONIMP)  was 
consistently  successful.  This  used  the  exact  second  derivatives  of 
both  objective  function  and  constraints. 

Given  the  success  of  the  truncated  Newton  method  at  solving 
unconstrained  problems.  Price  (1987)  experimented  with  solving  these 
problems  by  applying  that  algorithm  to  the  Di  Pillo  and  Grippo  (1979) 
exact  penalty  function 

P(x,X)  «  f  -  X*e  +  p  e’^e  +  P  v^v 

where  v  -  N^NX  -  N^'vf  (4.5) 

and  -  Se^/ax^. 

This  penalty  function  has  the  property  that  its  mininum  occurs  at  the 
solution  of  the  equality  constrained  problem  for  all  r  small  enough  and 
0  big  enough.  He  showed  that  if  the  Hessian  of  the  term  v^v  is 
approximated  in  the  least  squares  sense  by  2V^V 

*  ®v,/3x^ 

then  the  Hessian  of  P(x,X)  can  be  adequately  calculated  from  the  second 
derivatives  of  the  functions  and  constraints.  He  used  a  crude  early 
form  of  automatic  differentiation  to  obtain  his  results  but  the 
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algorithm  LARDPG  solved  all  of  the  eight  highly  nonlinearly  constrained 
problems  tested. 


Code 

Success 

Fail 

LARDPG 

8 

0 

CONIMP 

S 

0 

OPXRQP 

3 

5 

VMCON 

3 

5 

VMCVD 

3 

5 

Table  20.  A  Comparison  of  Success  Rates 
A  typical  problem  of  this  type  on  vhich  the  RQP  methods  fail  is  : 


Objective  function 
7 

i«l 

(1-Xi^j)^10.1[x^^j-1)%  (Xi^3-1)M  +  19.8(Xj^3-1)(x^^,-1) 

Constraints 


XjXjXjX^  ■  1.0 

XjXjXjX^XjXj  ■  1.0 

XiXjXjX^XjXjXtX,  - 

*1*2  *3  *4  *5  *6 ’‘7 


1.0 


10  “ 


1.0 


Initial  Point 


X,  -  (-2,  -%  3,  1/3,  -4,  ->4,  5,  1/5,  -6,  -1/6)* 
Solution 


X*  -  (1,  1,  1,  1,  1,  1,  1,  1,  1,  D* 
F(x*)  -  0.0 
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The  detailed  results  of  these  five  algorithms  on  this  problem  are  given 
below.  Here  NF  is  the  number  of  function/constraint  evaluations  and 
NGF  the  number  of (second] derivative  evaluations.  FV  the  function  value 
at  the  final  point  and  VC  the  constraint  violation. 


Code 

NF 

NGF 

FV 

VC 

CPU 

LARDPG 

60 

38 

6.8x10"* 

9x10"^* 

110.53 

CONIMP 

56 

332 

4.5x10"^’ 

2x10"^’ 

2.03 

OPXRQP 

34 

04 

3.13x10* 

5x10"^* 

0.54 

VMCON 

68 

02 

3.13x10* 

3.10"'® 

1.60 

VMCWD 

06 

06 

1.24x10* 

9x10^ 

1.17 

Table  21.  The  Relative  Performance  of  Various  Codes 


The  automatic  differentiation  of  the  function  and  constraints  could  now 
be  performed  with  sparse  triplet  arithmetic  or  even  by  reverse 
differentiation  and  the  Hessian  calculations  could  then  benefit  by  both 
being  performed  in  parallel  and  using  the  partially  separable  nature  of 
P(x,X)  (4.5),  so  the  results  of  the  LARPDG  algorithm  would  be  very  much 
faster. 


As  a  consequence  of  these  results  Bartholomew-Biggs  devised  an 
algorithm  OPALQP  (1987)  as  an  improvement  on  OPXRQP,  this  is  based  on 
the  function 


P(x,r) 


(  E 

'  '' 

*  J 

^  \ 

2n 

1  i.l 

ei(x)  -  X^ 

^  4 

+  51 

j-1 

0,hj(x)  -  1  X^ 

i  4 

4 

(4.6) 


where  X  denotes  the  current  estimate  of  the  Lagrange  Multipliers.  This 
algorithm  works  very  well  for  small  problems  and  is  based  on  the 
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iteration 


Ax  -  B"^(a’p  -  V£) 


where  (  f- 1  +  AB"^A*  j  p  -  AB'Wf  -  c  +  |-  X  . 


(4.7) 


An  analysis  of  the  behaviour  of  this  algorithm  shoved  that  as  the 
problem  size  increased  90X  of  the  computational  effort  within  the 
algorithm  was  spent  in  each  iteration  in  forming  the  product  AB~^A^ 
which  requires  2En^  operations.  This  is  a  large  overhead  as  the 
dimension  increases  and  one  that  can  not  really  be  speeded  up  by  doing 
parallel  operations  on  our  transputer  network,  due  to  the  large  amount 
of  communication  between  processors  that  would  be  required.  It  has 
been  our  experience  that  it  takes  3.7  times  as  long  to  send  one 
floating  point  number  between  processors  than  to  perform  one  numerical 
operation,  it  was  therefore  decided  not  to  attempt  to  run  OPALQP  in 
parallel  on  that  system.  It  would  of  course  be  a  very  efficient 
operation  to  do  in  parallel  on  a  machine  on  which  the  communications 
were  faster,  relative  to  the  arithmetic,  as  demonstrated  by  Conforti 
(1989). 


Having  rejected  OPALQP  for  implementation  on  the  transputer  network,  we 
looked  for  a  variant  designed  to  run  on  larger  problems  and  selected 
MINISH  Gould  (1986)  this  uses  the  simplest  penalty  function  (4.2)  and 
applies  the  S.U.H.T.  approach. 

In  particular  given  r^  and  r^^^^  s.t.  0  <  r^^^^  <  r^ 

set  r  ■  r^  then  UNTIL  r  <  r^^^^ 

Solve  Min  P(x,r) 

Set  r  -  r  X  10"* 

with  default  values  of  r^  and  r^^^^  being  10"^  and  10"^^  respectively 
and  the  minimisation  of  P(x,r)  being  terminated  when  IIVPII^  <  10'  ■^Vr. 
To  obtain  the  results  given  later  this  terminated  criteria  was  changed 
to  I IVPI  Ij  <  10‘^Vr. 

The  minimisation  is  performed  by  Newton's  method,  the  equations  we 
solved  were 
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il 


(4.8) 


which  is  of  course  larger  in  dimension  than  that  used  in  OPALQP,  but 
does  not  involve  the  large  matrix  multiplication.  In  our 
implementation  ve  used  the  BFGS  variable  metric  method  to  update  B 
dividing  this  by  rows  over  the  transputer  network  and  replaced  the 
direct  equation  solver  suggested  by  Gould  by  the  Iterative  CGS  method 
of  Sonneveld  (1986)  which  does  not  depend  on  the  matrix  being  symmetric 
or  positive  definite  and  only  uses  the  matrix  within  matrix/vector 
products  that  can  be  performed  in  parallel. 

Our  network  contained  up  to  14  transputers  and  was  configured  as  a  tree 
containing  one  transputer  directly  connected  to  the  root,  three 
transputers  in  the  next  level  and  nine  in  the  final  level,  As  each 
transputer  only  has  four  links  it  can  only  communicate  v  th  four  other 
transputers  and  in  our  tree  this  is  one  transputer  at  the  higher  level 
and  three  at  the  lower.  It  is  therefore  logical  to  perform  experiments 
with  P  a  4  and  14. 

4.2  Results 

Both  OPALQP  and  MINISH  were  implemented  in  3L  parallel  FORTRAN  and  run 
on  a  network  of  14  T800  transputers.  Both  algorithms  were  run  on  two 
test  problems: 

Problem  1 

k-2 

Subject  to  x^  ■  0  i  >  1,  ...,  k-1 

®1  -  -  ®i  +  l  ^  ^ 

where  *  1.0  +  1.01^”^  ,  1  £  1  S  k+1 
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The  following  results  are  for  each  algorithm  run  in  a  sequential 
fashion  on  a  single  transputer. 


Problem  1 

NF 

NG 

F 

Time  (Secs) 

OPALQP 

k-lO 

15 

13 

1.639x10"“ 

3.07 

k.20 

30 

28 

3.080x10"“ 

34.69 

k«30 

66 

45 

1.729x10"“ 

152.839 

MINISH 

k-10 

37 

27 

2.846X10"“ 

8.64 

k.20 

58 

43 

7.833X10"“ 

110.08 

k.30 

149 

71 

2.616x10"“ 

813.15 

Problem  2 

OPALQP 

k-lO 

28 

28 

550.0 

42.83 

k.20 

53 

46 

1050.0 

484.82 

k.30 

61 

55 

1550.0 

2121.36 

MINISB 

k.lO 

54 

29 

550.0 

108.79 

k.20 

81 

34 

1050.0 

847.32 

k.30 

95 

51 

1549.9 

4872.36 

Table  22.  Comparison  of  OPALQP  and  MINISH 


From  these  results  it  can  be  seen  that  OPALQP  solves  the  test  problems 
in  around  a  third  of  the  time  that  MINISB  takes.  No  doubt  due  to  the 
fact  the  MINISH  adopts  the  S.U.M.T.  strategy.  However  as  explained 
earlier  MINISH  has  more  scope  for  speed  up  when  adapted  for  use  on  a  . 
transputer  network.  It  was  not  felt  that  significant  improvement  could 
be  gained  by  parallelising  OPALQP  but  it  was  hoped  that  parallelising 
MINISH  would  make  it  more  competitive,  particularly  for  larger 
problems . 

MINISH  was  therefore  adapted  to  perform  the  BFGS  update  and  to  solve 
the  Newton  equations  in  parallel  on  a  network  of  four  and  then  14 
transputers.  These  adaptations  gave  the  following  results. 
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Problem  1 

NF 

NG 

F 

Time  (Secs) 

k-lO 

33 

27 

1. 050x10"“ 

6.94 

k.20 

58 

43 

7.844x10"“ 

61.29 

k-30 

149 

71 

2.616x10"“ 

379.30 

Problem  2 

k.lO 

54 

29 

550.0 

62.31 

k.20 

84 

35 

1050.0 

475.05 

k.30 

95 

51 

1549.9 

2317.56 

Table  23.  (4  Transputers) 


Problem  1 

NF 

NG 

F 

Time  (Secs) 

k«10 

33 

27 

2.819x10"“ 

9.24 

k*20 

59 

43 

8.248x10"“ 

74.43 

k.30 

149 

71 

2.616x10"“ 

342.62 

Problem  2 

k-lO 

54 

29 

550.0 

100.21 

k.20 

81 

34 

1050.0 

512.14 

k.30 

95 

51 

1549.9 

2179.43 

Table  24.  (14  Transputers) 


By  comparing  the  results  in  Tables  1  and  2,  it  can  be  seen  that  there 
vas  a  speed  up  factor  of  around  2  when  going  from  a  single  transputer 
to  4  transputers  on  both  problems. 

However  unfortunately  when  the  times  for  MINISH  on  four  transputers  are 
compared  to  the  times  for  OPALQP  on  a  single  transputer  given  in  Table 
1,  OPALQP  vas  still  quicker  on  each  version  of  the  problems. 


-  57  - 


On  a  14  transputer  network  for  values  of  k*10  and  20,  both  problems 
were  too  small  for  any  advantage  to  be  gained  from  parallelisation.  In 
fact  the  amount  of  communication  required  to  send  one  fourteenth  of  the 
data  around  the  network,  out-weighed  any  savings  in  the  computation. 
This  resulted  in  slower  times  on  14  transputers  than  on  4  transputers, 
on  both  problems,  except  when  k>30,  when  a  minor  speed  up  was  achieved 
on  both  problems. 

4.3  Conclusions 

The  modified  version  of  MINISH  described  above  would  work  very 
effectively  on  a  parallel  processing  machine  on  which  communication 
time  for  moving  floating  point  numbers  between  processors  is 
considerably  faster  than  the  time  needed  to  perform  an  arithmetic  step. 

The  same  might  be  true  of  a  modified  version  of  OPALQP  that  also 
utilised  CGS  and  performed  the  calculation  in  three  stages 

u  «  A^z,  y  -  V  >  Ay  with  updated  not  B.  Each  of  these 

multiplication  could  be  performed  efficiently  in  parallel  on  a  machine 
with  fast  communications. 

If  the  problem  has  very  nonlinear  constraints  then  the  minimisation  of 
4.5  is  preferable  as  it  overcomes  the  difficulties  due  to  the 
nonlinearities  and  can  be  done  effectively  by  the  truncated  Newton 
method. 

In  each  case  automatic  differentiation  can  be  performed  to  obtain  the 
necessary  derivatives.  This  can  utilise  both  the  parallelism  due  to 
partial  separability  and  if  the  second  derivatives  are  used  the 
parallelism  in  the  reverse  differentiation  stage. 
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5.  Pure  Speculation? 


In  section  2  ve  have  shown  that  any  function  calculation  may  be 
represented  by  a  task  graph  and  illustrated  this  concept  with  the 
graph  for  Rosenbrock's  function.  Ve  then  shoved  that  the  operations 
needed  for  reverse  automatic  differentiation  were  representable  by  a 
mirror  image  of  the  same  graph.  Again  ve  noted  that  these  graphs  were 
linked  by  the  nonlinearities  in  the  objective  function  and  that  a 
reduced  graph  could  be  used  to  link  the  coordinate  values  with  the 
gradient  values  and  that  by  extending  this  to  Vf^d  and  reversing  this 
graph  ve  obtain  a  graph  for  V^fd. 

The  proposal  that  this  graph  might  be  used  to  calculate  the  Nevton 
step  without  forming  the  Hessian  appears  in  Grievank  (1989b}. 

Let  us  consider  what  might  be  involved. 

Ve  have  seen  that  for  Rosenbrock's  function  the  operations  and  the 
task  graph  are  as  given  in  Figure  6. 


Figure  6.  Task  graph  of  Rosenbrocks  Function 

and  that  using  reverse  differentiation  ve  can  calculate  the  gradients 
as  shown  in  Figure  7. 
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Figure  7.  Reverse  task  graph  for  the  gradient  vector 

It  will  be  noted  that  the  elementary  operations  of  add,  subtract  and 
multiply  appearing  in  £  contribute  a  fork  in  V£  vhile  those 
appearing  nonlinearly  in  f  also  appear  in  7f.  Ve  note  that  until 

these  appear  in  the  gradient  tree  the  x^  are  constant  and 
independent  of  x. 

Nov  linking  these  two  graphs  together  through  the  nonlinearities  ve 
obtain  Figure  8. 


Figure  8.  A  direct  task  graph  for  the  gradient  vector 
This  emphasises  the  symmetric  nature  of  the  graph. 
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If  ve  now  perform  reverse  differentiation  on  this  graph  extended  to 

Xjj  ■  djXj  +  djXj  and  representing  the  variables  in  the  graph  of  ^fd 
by  and  y^  when  it  corresponds  to  x^  and  x^  respectively  we  obtain 
Figure  9. 


where 


Ve  note  that  the  structure  for  this  graph  is  identical  to  that  for  Vf 
but  that  the  operations  at  the  nodes  are  slightly  different.  In 

looking  for  the  Newton  step  of  this  problem  we  may  treat  any  x^,  x^ 
remaining  in  the  calculation  as  constants  and  can  now  represent  the 
graph  as  a  sparse  matrix.  By  ordering  the  variables  as 


yi  ^2  ^3  y*  y-,  yi  y*  y^ 

we  obtain  the  sparse  system. 
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1 

yi 

‘^1  ' 

1 

y2 

-2Xj  1 

y^ 

0 

+1  -1  1 

y* 

0 

+1  1 

yi 

0 

-2x,  1 

yi 

0 

-2x5  1 

^4 

0 

-1  1 

0 

+1  1 

y2 

0 

-2X3  +1  *2X3  1 

.yi  . 

0 

Figure  10.  The  Hessian  Task  matrix  of  Rosenbrock's  function 


This  representation  stresses  the  lover  triangular  form  of  the  sparse 
matrix  and  its  symmetry  about  the  antidiagonal  which  appears  to  be  a 
property  of  all  "Hessian  Task"  matrices. 

The  forward  calculation  is:- 

given  d  calculate  the  value  of  V^fd  -  (yiy2)* 

This  is  obviously  straight  forward  from  the  above  matrix. 

The  Newton  calculation  in  contrast  is:- 

given  y^  «  -g^  and  y,  -  -gji 

calculate  the  values  of  d^  and  dj. 

To  formulate  this  calculation  in  matrix  terms  it  is  easiest  to  augment 
the  above  matrix  with  a  further  n  rows  and  columns  (so  remembering 
n  >  2  for  Rosenbrock's  function)  we  obtain  the  matrix  shown  in  Figure 
11. 
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This  matrix  is  an  echelon  form,  with  echelon  index  (El)  of  n  >  2, 
which  ve  may  write  as 


i.e.  Lv  +  Bd  «  0  -♦  v  ■  -L~^Bd 

Cv  .  g 

CL'^Bd  -  -g 

So  by  implication  f  «  CL~^B  i.e.  the  Hessian  Matrix  is  the  Shur 
Complement  of  the  augmented  matrix. 

In  Dixon  &  Maany  (1987)  the  operations  needed  to  solve  an  echelon 
matrix  of  this  type  were  analysed  and  shown  to  be  dominated  by 

EI(NNZ)  +  1/6(EI)’ 
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the  first  term  being  the  calculation  of  the  Shur  complement  by  El 
parallel  steps  and  the  second  the  solution  of  the  Shur  Complement 
equation  which  by  definition  is  symmetric.  If  El  >  n  this  operations 
count  is  precisely  what  would  be  expected  to  first  form  and  then  solve 
the  Hessian  matrix. 

However  if  we  could  sort  the  augmented  matrix  into  an  echelon  form 
with  El  <  n  the  operational  cost  would  be  greatly  reduced.  Dixon  & 
Haany  (1987)  also  describe  a  heuristic  sort  aimed  at  finding  an 
equivalent  matrix  with  a  low  El. 

In  that  paper  they  applied  their  echelon  method  to  the  non-symmetric 
Grenoble  matrices  in  the  Harwell  set  obtainable  from  Iain  Duff  and 
found  that  the  echelon  indices  obtained  were  remarkably  small. 

Grenoble  n  115  185  216  343  512  1107 

Matrices  El  31  58  18  49  32  83 

The  formation  of  the  Shur  complement  for  these  sorted  matrices 
followed  by  solving  the  equations  using  its  inverse  was  very 
efficient.  A  comparison  was  performed  against  a  conjugate  gradient 
algorithm  with  and  without  diagonal  preconditioning,  and  also  with 
MA28  and  F04QAF. 


Nonsymmetric 

Grenoble 

Problem 

CG 

PCG 

MA28 

P04QAF 

Echelon 

Method 

115 

3 

3 

0.4 

3 

0.8 

185 

29 

28 

2.3 

32 

2.6 

216  A 

11 

12 

1.8 

9 

0.8 

216  B 

87  F 

86  F 

1.4  F 

0.9  F 

0.9  F 

343 

27 

28 

4.1 

15.6 

3.28 

512 

54 

56 

10.3 

32.4 

3.38 

1107 

521  F 

519  F 

133  F 

684  F 

20.97 

Table  25.  The  performance  of  the  echelon  method  on  the  Grenoble  test 
set. 
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Ve  vere  quite  excited  with  these  results  but  unfortunately  the  method 
need  not  always  be  stable  (Duff  &  Reid,  private  communication). 

Indeed  our  results  indicated  it  was  always  unstable  if  the  original 
matrix  is  symmetric  and  positive  definite.  It  did  indeed  fail  on  all 
the  symmetric  versions  of  the  Grenoble  problem;  we  did  however 
complete  our  tests  on  these  problems  and  for  interest  the  results  are 
given  below. 


Symmetric 

Grenoble 

Problem 

CG 

PCG 

ILU/ 

CG 

MA28 

F04QAF 

115 

1.61 

1.64 

1.51 

0.88 

1.11 

216 

11.27 

6.77 

4.04 

5.2 

5.47 

343 

13.57 

9.39 

7.38 

26.24 

7.19 

512 

24.90 

15.80 

12.01 

64.3 

12.77 

1107 

59.15 

37.64 

31.55 

S 

25.04 

Table  26.  A  comparison  of  the  relative  performance  of  a  number  of 
codes  on  the  symmetric  Grenoble  problem 


Here  S  indicates  that  too  much  store  was  required  for  the  system  being 
used. 


Returning  now  to  the  task  matrix  of  Rosenbrock's  function  the  natural 
form  of  which  already  has  an  El  -  2,  we  found  to  our  surprise  that  our 
echelon  sort  found  an  ordering  with  El  >  1. 


•  " 

■ 

• 

1 

1 

yi 

1 

1 

I 

yj 

-82 

1  ' 

1  2x, 

yi 

-1  1 

1 

<*1 

1 

1  1 

1 

y7 

-25,  1 

y7 

1  1 

1 

y4 

0 

1  ' 

y4 

25-  1 

*  1  1  -1 

1 

yj 

1 

-1  1 

1 

d. 

1 

1  1 

-  _  j 

y3 

1  -25,  1  -2xj 

ys 

Figure  12.  The  resorted  form  of  the  Nevton  Task  graph  equations  of 
Rosenbrock's  function 


Ve  note  that  if  ve  solve  this  by  the  Shur  Complement  method  ve  reduce 
both  parts  of  the  cost.  Ve  note  too  that  it  is  unstable  near  0. 

Effectively  this  matrix  implies  that  if  ve  take  y,  as  the  independent 
variable  then  ve  can  compute  all  the  variables  and  are  left  vith  one 
equation,  that  at  node  y^ ,  to  determine  y, .  Note  ve  could  interchange 
the  last  tvo  rovs  of  the  permutated  matrix  and  end  vith  node  y, 
determining  the  value  of  vhich  seems  to  preserve  the  symmetry 
better. 

Preliminary  investigation  indicates  that  an  echelon  sort  can  reduce 
the  echelon  index  of  the  augmented  task  graph  belov  n  and  thus  reduce 
the  cost  of  solving  the  task  graph  belov  that  of  forming  and  solving 
the  Hessian  matrix. 
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Conclusions 


In  this  report  ve  have  demonstrated 

(1)  that  the  use  of  defined  data  types  and  operator  overlays  in  Ada 
allows  optimisation  codes  to  be  written  so  that  they  reflect  the  basic 
algorithm. 

(2)  that  introducing  sparse  doublet  and  sparse  triplet  data  types 
enables  both  the  gradient  and  Hessian  of  partially  separable  objective 
functions  to  be  calculated  very  efficiently.  This  result  extends 
naturally  to  the  Jacobian  of  constraints  and  if  necessary  to  their 
Hessians. 

(3)  that  the  use  of  automatic  differentiation  combines  naturally  with 
the  truncated  Newton  method  to  efficiently  solve  large  unconstrained 
optimization  algorithms. 

(4)  that  the  use  of  "accurate"  dot  products  and  matrix  vector 
products  greatly  speeds  up  large  scale  optimization. 

(5)  that  automatic  differentiation  of  partially  separable  functions 
can  be  performed  effectively  using  concurrent  tasking  in  Ada  on  the 
Sequent  Balance. 

(6)  that  to  perform  the  linear  algebra  within  the  truncated  Newton 
algorithm  effectively,  in  parallel,  requires  a  machine  with  faster 
communications  relative  to  computation  than  that  available  on  the 
Sequence  Balance  or  the  transputer  network. 

(7)  that  the  linear  algebra  and  truncated  Newton  method  performed 
effectively  on  the  ICL/DAP  machine  with  4096  processors. 

(8)  that  automatic  differentiation  combines  effectively  with  RQP 
algorithms  for  the  sequential  solution  of  contrained  problems. 

(9)  that  when  the  constraints  are  too  nonlinear  for  RQP  algorithms  to 
be  effective  automatic  differentiation  combined  with  a  truncated 
Newton  algorithm  can  obtain  the  solution  when  applied  to  the  Di  Pillo 
Grippo  exact  penalty  function. 


(10)  that  the  linear  algebra  within  Gould's  MINISH  algorithm  can  be 
simply  divided  into  parallel  tasks  and  should  be  effective  on  a 
machine  with  a  faster  communication  time  than  our  transputer  network. 

(11)  that  the  use  of  reverse  differentiation  should  be  even  faster 
than  sparse  forward  automatic  differentiation  for  calculating  first 
derivatives,  and  can  be  performed  in  parallel  to  calculate  second 
derivatives,  and  may  lead  to  techniques  that  can  calculate  the  Newton 
step  cheaper  than  the  Hessian  matrix. 
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